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COMPARISON OF THREE DIFFERENT ITEM 
BANKING METHODS FOR LONGITUDINAL TEST EQUATING 

- Introduction * H 

•Classical test theory has bee'n found to have a nuhnber of deficiencies 
(Hamilton et ah; 1978). One such shortcoming is in the area of item 
difficulty and item discrimination statistics. In classical test theory these 
statistics are not invariant across groups" of examinees of different ability. 
That is, the item statistics gathered are useful only for populations similar 
to-those on which the statistics were gathered. In addition, comparisons of 
examinees on an ability trait that can be measured by a test are Jimited to 
situations where the tests used ,are either the same or parallel. Also, 
examinee performance on a test item can not be predicted; and Jesting 
problems such as test' design^item bias, and test equating found no 
adequate solutions. * 4 

In contrast to classical test theory, the Rasch one-parameter latent- 

4 ^ 

traVt model" has item parameters which are said to be sample Invariant 
(Wright, 1977). Because item parameters from different calibrations are 
linearly related, they can be put onto a single common scaje and "the 
measures implied by scores on all such tests are automatically equated and 
no further collection or analysis o_f data is needed" (Wright, 1977, p. 106)* 
Moreover, because all the items are on a common scale, thfey can be used 
*to make up new tests which would be equated on the common scale. 



However, vc ith each 'equating there is a standard error of equating 
which, although small in comparison to the standard error of measurement 
(Angoff, 1971; Bryman, 1976), ^transmitted to every score. The cumula- 
tive effect of this error can becomfe- quUe large in a longitudinal testing 
program. In addition, when equating tests using items calibrated by nieans 
of the Rasch latent-trait model, the items used as the common cdre in the 
linking .process will have two difficulty parameter values— one from the 
previous testing (its value in the item bank prior to testing) and a new ^ 
value from the calibration on the form to be equated to the existing bank. 

* • 

The procedure that should be utilized in maintaining an item bank has 
been the subject of some controversy. Researchers from 1978 to the 
present disagree as to which difficulty calibrations should be In the bank— 
tfye original* the most recen^ or some c6mbinatiorr. The reason given to 
keep^ the original calibrations in the bank (Mead, 1981) is that these 
calibrations act as an anchor. On the opposite side," the argument 
presented (Rentz, 1978; Cook, Eignor, Peterson, 1982) is that if changes in 
learning/teaching have occurred, the item calibrations should reflect them. 
Thus, this viewpoint advocates that lt$m difficulty calibrations should be 
updated, after each use. Ridenour (19&0) .has recently .argued that one t 
should Consider as much., information as possible wheri maintaining the 

0 

underlying scale in an item bank. One should not foH^w a set routine, 
always either using the original item difficulty >or updating* the calibration 



to the most recent value, but should consider all information and choose 
the most &ppfopria>e value. / 

V 

The purpose of this,stu<jy is to investigate the cumulative -effects of 
three different item-bank maintenance procedures on a series of 
longitudinal equatings of tests measuring the same characteristic. Many 
testing jy^grams are run on the basis of future tests being equal in 
difficulty to the first test administered. \ Thus* it s is important that the 
item-bank maintenance procedure chosen not cause the difficulty 
calibrations of the items t^> be artificially inflated or deflated resulting in 
succeeding tests tfeing, in reality/easier or harder than the first. * 

Methodology . 

Data . During the fall of 1978, the Commonwealth of Virginia began its 
Minimum Competency Testing Program with the administration of state- 
wide reading and mathematics tests to all, tenth-grade students. To date, 
there have been 'twelve administrations of these tests either .using new 
equated editions or, in two cases, reusing earlier editions. The data for 
this study came from selected administrations- of the reading test. The" 
one-parameter latent-trait model (Rasch) has been used throughout the 
^program to obtain the item calibrations. BICAL-II (Wright, Mead, Bell, 
1977) was employed tar the first administration and BICAL-III (Wright, 
Mead, Bell, 1979) for all subsequent administrations* 



The calibrations on the Fall t978 test (0001*) were made on a 10 
r * student random sample drawn from a tested population of approximately 
&0,000 students.. These sixty items and their difficulty value calibrations 
on this administration constitute the initial item bank. 

The subsequent administrations of the test chosen for the study are 
the ones in which these sixty items appear again. For the purpose of 
monitoring their calibrations/ these items have been placed on operational 
tests as experimental items.- They were spread over seven forms on the 
v March 4980 administration (0003), over five forms on tKe March 1981 
administration (0005), and over four forms on the March 1982 administra- 



tion (000§). Using the operational items as "the cfommon core,^a number of 
te^t forms were used during the March administrations to- monitor the 
difficulty values of some older items and to gather statistical information 
oq newly acquired items. These forms are packaged in sequential ordei^ 
with packages beginning with as many, different form numbers, as forms' 
being administered. The tests Sfe given to the students ia the order in 
which they appear in the package. There were eight test forms in ifae 
March 1980 administration and twenty test forms in each of the March 
1 1981 and 1982 administrations. The calibrations on these three administra- 
tions (0003, 0005,. 0008) were on a 10,000 student random sample.drawrf* 

* Administration Number ' * 4 
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from a tested population of approximately 19,000 on the March 1980 
administration and approximately 80,000 tenth-grade students on the 
March 1981 and March 1982 administrations. The entire original, Fall 
8, test was reused for the October 1981 administration (0007). The 
calibrations for this administration <Were on the entire population, 
approximately 3,300, made up of twelfth-graders who either had failed the 
test previously or were transfer students. 

In addition to multiple calibrations of the items on the Fall 1978 test, 

the sixty items that were used as operational items on the March 1980 test 

» 

haye been used on March 1981 and 1982 tests, a few as operational items 
but the majority as experimental items, also- for thejpurpose of monitoring 
their calibrations. 

Procedure . The sixty items from the first administration of the reading 
test (0001), and the BICAL-II calibrations of these items constituted the 
initial bank in this study. The items on subsequent administrations ^003, 
0005, 0007, 0008) were linked to the bank in the order of their administra- 
tion. Items included in all linking procedures wetre those whose difficulty 
value calibration on the form to be equated was not significantly different 
from the difficulty value in the bank as tested by the v evaluation statistic 
given in Best Test Design (Wright and Stone, 1979). A computer program 
(BLINK), developed in-house, has been used to link the items to the existing 



bank. This 'program is based on the linking procedures for a complete web 
discussed in the chapter "Constructing a Variable 1 ' in Best Test Design 
(Wright and Stone, 1979), 

The item-bank maintenance procedures used in this investigation are 
the following: In the first procedure (Method I), the difficulty values of^the 
items previously in the bank were updated to the new values after each 
test administration wherein these'items appeared. In the second procedure 
(Method II), item bank difficulty values were. not Ranged after use on 
tests, but those items whose difficulty values were sufficiently modified 



that they were not included on the link were permanently removed from- 
the bank'f 



In the* third procedure, item bank difficulty values were not altered 
after use in tests if these items were used on the link. If the items were 
not employed in calculating the linking constant, because there was a 
significant change between the difficulty value in the bank and the one on 
the test to be equated, a decision concerning which difficulty value would 
be put into the bank was then made. In Method, Ilia, (automatic update), 
the difficulty values of the items taken off the link were automatically 
updated to difficulty values on the test to be. equated and put into the 
bank. In Method Illb, (studied/considered .update), a determination 

concerning the difficulty value to be put into the bank was made in the 

s 
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following manner; If this calibration was the second calibration for the 
item, then the average of thg difficulty value in the bank and the present 
calibration was the value placed in the bank for future use. If the item was 
taken off the link on the third or subsequent calibration, the difficulty 
values of the item obtained in each previous calibration was examined. If 
two or more difficulty values were close, the average of these was used as 
the new difficulty value placed irt the bank. If all calibrations yere 
significantly different from each other, the item was then checked to see 
if "it had been rewritten and, if so, the latest calibration was placed in the 
,bank. If the item had not been modified, the fit statistics ir) the BICAL 
calibrations for each administration wer© inspected. If the item had largp 
fit statistics on one calibration, then the average of the difficulty values , 
for the other two administrations was the value placed in the bank. If the 

r 

item had large fit statistics on two calibrations, then the difficulty value 

of the third calibration was placed in the bank. If no basis could be found 

k 

•for the large difference in difficulty values on these administrations, the 

' 4 

most recent difficulty value calibration w$s placed in the bank for future 

I ■ ' 

use. 

V 

Data analysis . Several methods were employed to compare outcomes of 
the different item-bank maintenance procedures. Comparisons were made 
on the difficulty values of the items in the various banks resulting from the 
linking procedures stated above* * ■ 
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Graphs of the difficulty value parameters of the itemswi the original 
bank were made by plotting the values estimated in the o^ginal bank 
against those in the bank at the end of the procedures. at 



Comparisons were made in the correlations of the difficulty value 
parameters of the items in the original bank and the same items at tfie end 

%'i 

_oLlh_e banj<in^prpj:_edures. _JThe mean_s and stan dard deviations of *the se 

- s 

lit 

items were also calculated and cotnpared. Moreover, the means oflthe 

% 

difficulty values of all the items in the banks resulting from these different 

% : 
v 

procedures were compared, adjusting for the deletions made during the 

V 

Method II procedure. , |_ 



Using the items that were placed on the March 1982 (0008) test and 
their difficulty p^f ameter values m each of the banks at the completion of 
the above procedures, , calculations were made to estimate ability 
parameters. These calculations are similar to those in the BICAL-III 
computer program except that the item difficulty values are held constant 
while the maximum likelihood calculations are made .for the ability values 
only. Since the-^billty values of the group who actually were administered 
the test (administration 0008) were known, comparisons of the various 
effec\s for the different procedures could be made. 

J 
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Results** ^ 

Table 1 presents the means and standard deviations of the item 
difficulty values of fifty-four out of the original sixty items that comprised 
the original bantof (Six of "the original items have been rewritten and, for 
that reason, were not used in this study). The procedure* used in Method II 
required that the item difficulty values remain approximately the same 
and, if an item was not used in calibrating the linking-eonstant because 
there was a significant change in its value on the test being equated and its 
value in the bank, the item was deleted from the bank for all future use. 
Fourteen out of the original fifty-four items were deleted from the bank in 
the linking processes in this method. The mean (in logits) of the difficulty 
values of the fifty-four items was ,004 with a standard deviation of 1,38 in 
the original bank, ,083 with a standard deviation of 1,27 at the end of the 
linking process using Method I, ,112 with a standard deviation of 1.15 using 
Method IIIA, and ,022.with a standard. deviation of 1,25 using Method*IIIB, 

Also in Table 1 are the correlations of the initial difficulty values of 

« 

these original fifty-four items with the difficulty values existing in 4he 
banks after equating by means of Methods 1, IIIA, and IIIB, The slopes'of 

a 

the regression lines are also shown in Table 1, 

An output of the computer program used in this study to obtain 
ability scores is the pairing of the number of items correct score with an 
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ability score in logits. For these tests the passing score was set at 
attaining at least 1.12 logits. Table 2 contains the ability scores necessary 
for passing calibrated from the difficulty parameter values of the original 
fifty-four items as they were in the original bank and as they were in the 
banks at the end of the linking processes using Method I, IIIA, and IIIB. 
This ability score necessary for passing for the values in the original bank 
w*s K15&, at the end of Method I was 1.178, at the-end-of Method IIIA was 
1.156, and at the end of Method IIIB was 1.22^. Thus to obtain a passing 
score of at least 1.12 logits, Methods I and IIIA would give the same 
attainment of number correct items as was true in the or 
Method IIIB would require one more correct item. 

Figures 1-3 are graphs of the difficulty values -of the. original fifty- 
four items with the values in the original bank plotted against the values in 
^•the bank at the end of Methods I, IIIA and IIIB. Figures 4-6 are graphs of 
the ability scores calibrated using the difficulty values of the original 
fifty-four items derived at the end. of Methods I, IIIA and IIIB plotted 
against the ability values derived from using the original difficulty values. 
Figures 7-9 are graphs of the ability scores derived from the difficulty 
values at the end of Methods I, IIIA, and IIIB plotted against each other. 

The means of the difficulty values of all 804 items in the final banks 
by Methods I, IIIA, and IIIB and the means of the difficulty values "of all- 

i - 



iginal bank, but 
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^ items in the final banks minus the values of the thirty-two items that were 
' deleted by Method II are presented in Table 3. ^ 

The items that, comprised the test for administration 0008 weYe 

selected from each of the banks resulting from Methods I, II, IHA and MB. 

The means and standard deviations of the difficulty values of these items 

in the corresponding banks are shown in Table 4. AlSo shown are ttife ability 

passing scores, in logits,' given that a value. of at l£ast 1.12 logits is 
0— * 

required. The number of items correct for the corresponding logit score is 
also shovrn, Figures 10-15 are the graphs of the ability scores, calibrated 

. from the difficulty values of the 0008 administration items after they were 
equated to the banks resulting from each of the methods, plotted against, 
each other, In Table 5 are the equating constants needed to equate the 
test given in administration 0008 to the existing banks, by each of .the 
methods*, These constants range from -.617 to -.569. 

> 

I 

In Table 6 are the number of items removed from the calculations for 
deriving the linking constants for each of the methods in each of theUinks. 
Since the first link was the same for all methods, the items removed from 
the calculations were the same. However, for Method I the number of 
items taken off the second link include nine items previously removed from 
the calculations in the first link, those takfen o.ff the third, link include 
seven items previously deleted from the first two links, "and those taken off 
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the fourth link include one item which was removed from the first two 
links and one item removed from the third link. During the Method II 
„ process, if an item was removed from the lining constant calculations, it 
was permanently removed from the bank and thus entered into no further 
calculations. During the calculations of the linking constants in Method 
IIIA there were- nine items taken off the second link tKat had also been 
removed from the first link, six items taken off the third link that bad been 
removed from the first two links, and one item removed from the fourth t z 
link that had been rertfaved from the first and second links v For Method 
IIIB in the second link two items were removed that had been taken off the 
first liijik, in the third link one item that had been repioved from the first 
two links and one itfflm that had been taken off the first link were removed, 
and in the fourth link one item that had been removed from the first link 
and one item that {iad been taken off the first three links were taken out of ■ 
the linking calculations. 



4 
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Discussion 

The purpose of this ^tudy w.as to investigate the cumulative effects 
of three ^"dmerent item-bank maintenance procedures on a series of 
longitudinal equatings o^ tests measuring the same characteristic. One 
way to compare these cumulative effects was to choose a v set of items 
comprising a test and to make various comparisons using both the difficulty 
values in the banks resulting from the different linking processes and the 
ability scores calibrated from these difficulty values. 

For this study, the items used on the test for administration 0008 
were chosen. The means of the difficulty values of these items in the 
different banks at the end of the linking processes .(Table 4) were 
compared. These ranged from -.798 for Method IIIA to -,846 for Method 
IIIB. Ability scores were calculated using the difficulty values found in the 
banks resulting from different processes. Since an ability score of at 
least 1.12 logits must be attained in order to pass, the ability values needed 
for a passing score ranged from 1.148 for Method IIIB to 1.195 for Method 
IIIA. For each number of items correct score there is a corresponding 
ability score. Although, for this test, there was variability in the ability 
logit score, the number of items correct needed for passing w$s identical 
for all methods. Figures 10-15 are graphs of the ability scores (at each 
number of items correct value) calibrated fror^ the difficulty values of the 
items in this test at the end of the procedures. There was some variation 



v. . ■ ' 

* * 

■r~\ ' ; ' 

in the low and high ability scores but very little variation in the middle 
scores. Although the procedures used in Methods -IIIA and IIIB were the 
mo^t similar, the difficulty value means and the ability scores derived from 
these difficulty values show the greatest difference: 

Th? differences in the means of the difficulty values among the 
methods and the differences in ability score logits necessary for^/the 
passing score seem to indicate that, although at this stage there 'seems to* 
be no difference among the methods in the determination of a passing 
score on a test, significant differences ..may show up in^ subsequent 
longitudinal equatings. HoweVer f since the difficulty, values of the items 
chosen for the test produce passing ability stores which are equated to the 
same number correct score, no comparisons can be made of the effects of 
-using each of thj^-differfent procedures. 

Another way -to compare thfc cumulative effects of the different 
item-bank maintenance procedures was to use the original fift^-four items* 
and compare the difficulty values and ability score values in the original 
.bank to those values at the end of Methods I, IIIA, and IIIB. Method II was * 
not included in these comparisons because, in this procedure, an item 
either retained the difficulty value in the original bank or was deleted from 
the bank permanently' (as, fourteen 'original items were), if the difficulty 
value of the item on the test to be equated differed significantly from the 

( ' 

bank value. * 
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The means of the difficulty values of these fifty-four items in the 

original bank and at the end of.ttie procedures are given in Table 1. The 

. means of the difficulty values at the end of the procedures range from .022 

to .112 compared to* .004 in the original bank. The values in the bank 

• following' the Method iftB procedure have a correlation of .990 with the 

original bank values, while the correlation of. the Methods I and IIIA 

resulting values with the original bank values are .956 aftd .954 

respectively. Figures 1-3 are graphs of the difficulty values of the original 

fifty-four items at the end of .Methods I, IIIA and IIIB plotted against the 

difficulty values of these items in the original bank. In Figure 1 except for 

a few outliers the difficulty values of the items are close to a line which 

has a slope of 1 and a vertical axis -intercept of 0. In Me.thods IIIA and IIIB 

the difficulty value of an item retained its original value unless the 

v 

difficulty value in the original bank and that on the test y) be equated 
differed so significant^ that the value was not used in the link. In that 
case, the item was given a new value. Comparing Figures 2 and 3 it is seen 
that Method IIIA resulted in the new difficulty values being further from a 
line with slope I and vertical axis-intercept of 0 than those given in 
Method IIIB. 

As seen in Table 2 Methods I and IIIA require Me same number of 
items correct score as was needed in the original bank, while Method IIIB 
requires one additional item correct. Moreover, the ability passing score 

17 



resulting from the values in Method IIIA is the 'closest to that^of the 
'original bank (1.156 to 1.158). Figure 4 shows that^ the ability scores (at 

each numbe^^f items correct value) 'are less when derived from the 
# original bank difficulty values than they are when calibrateti from those 

difficulty values at the end of Method I. Figures 5 and 6 shov^ that for 

ability scofe values less than 1 and 0 respectively, the ability scores 
^derive^from the original bank difficulty values are less than those derived 

from the results of Method IIIA and IIIB, but for larger ability scores, the 

opposite situation is true. 



Figures. ,7-9 show the graphs of ability scores derived from the 
difficulty values of^he original fifty-four items a\ the^f^d of Methods I, 
IIIA, and IIIB plotted against each other. *\For ability scores D&low 0^ those 
scores derived from Method I difficulty values are less than those 
calibrated from Method IIIA .values; however, for ability scores above 0, 
the reverse is true. The ability jscores derived from the -Method I process 
are greater than those calibrated from the difficulty values in Method MB, 
while the ability scores derived from Method IIIA are greater than those 
from Method IIIB. 

Although the Method IIIA difficulty value mean differs more from the 
original bank difficulty value than those resulting from Methods I and IIIB, 
ttie Method IIIA ability score necessary for passing Wcls the closest to the 



original value. Method IIIB item difficulty values correlated to a greater 

* 

degree \*'ith the hem difficulty values in the original bank than did those 
resulting from Methods I or IIIA. However, the Method IIIB ability score 
necessary for passing differed the most from that calibrated f cjr the 
original item bank and required a number correct score one greater than 
Methods I, IIIA, *and the original item bank. These results are not as 
donsisterTS^s those sbtained using the items of the test in administration 
0008. ' • ' * 



Table 3 contains both the mean item difficulty values of all theitems 

in the various banks and the^meaji item difficulty of all except the thirty- 

two items that were deleted in the Method II process, The values resulting 

fri?m the Method illB. procedure have the lowest mean. Those items that 

werfe deleted im the Method II procedure had'a smaller effect on Method 

* 

IIIA than on Methods I and IIIB. * . 



From the preceding discussioh it . is clear that there are slight 
differences in the item banking procedures for the original fiftyr-four 

>•. 

items. However, these differences did- not manifest .themselves 
significantly when>the difficult^vakies ifi the different banks of the items 
chosen in such a manner as to comprise a test equated to the original Fall 

1978 test were compared. For this reason it must be Concluded that on the 

1 

basis of this study there are no significant differences in the different 
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item-bank maintenance procedures studied for longitudinal equating. 

However, it may be that four longitudinal links are insufficient and that 

differences in the procedures may become significant as the number of 
\ . . 

links increase. 
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TABLE 1 



Summary Statistics and Correlations of Difficulty Values 
for ttje Fifty-four Original Bank Items 

<• ^Original Bank Method I Method II , Method IIIA Method IIIB 

Meap .004 .083 * .1.12 ^ .022 

Standard deviation 1.38 - 1.27 * 1.15 1.25 

Correlation witH-*) t 

Original Bank Vafues ■ - .956 * * .954 .990 
Slope ^f-^egressioa 

Line ^ - .88/1- * .795 .894' 

/ r 

*No values here^beckuse 14 items *ere deleted from the bank, all other items retained their 
original bank diffjgiilty values 

7 

. • " V ' TABLE 2 ' 

. Fifty-four Original Bank Items 

* ' " ■ * 

"I Original Bank Method I Method II Method IIIA Method IIIB - 

_ tf ■ • * 

Number of Items Correct 
Needed for Passing , \ + 38 - 38 * 38 ' 39 

>*'• 

Ability- Passing Score ; $ "\ 
(Logits) ''1 1.158 1.178 * 1.156 1.222 
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TABLE 3 

Mean Difficulty Values^ of Items in Final Banks 



All Items 

All Items Except the 
32 Deleted in Method II 



Method I 
-.238 

-.213 



Method II . Method IIIA Method MB 



-.228 



-.230 
-.211 



-.264 
-.2^8 



"> ! TABLE 4 . ' , . 

Summary Statistics of All Items Used in Administration 0008 



Method I Method II Method IIIA Method IIIB 



Mean of Difficulty Values 

Standard Deviation of 
Difficulty Values 

Numberfof Item$ Correct 
Needed for Passing 

Ability' Passing Score 
(Logits) 



-.803 



1.43 



49 



1./90, 



.-.838 



1.43 



49 



1.156 



-.798 
1.43 

49 * 

' » 

1.195 



-.846 
1.43 
49 

1.148 
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Method I 
-.5% 
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TABLE 5 



Equating Constants in Final Linkage 



Method II 



-.609 



Method IIIA 



-.569 



Method IIIB 



-.617 



Link 
* 1 
2 
3 
k 

" Total 



TABLE 6 , 

Number of Items Removed from Linking Calculations 
/ 

Method II 



Method I 
13 
20 
9 

23' 
65 



13 
9. 
1 
9 

32 



Method IIIA Method IIIB 



13 
17 
7 

13 
50 



13 
10 
3 

13 
39 
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